Goto

Collaborating Authors

 unique solution








Algorithmic Thinking Theory

Bateni, MohammadHossein, Cohen-Addad, Vincent, Gu, Yuzhou, Lattanzi, Silvio, Meierhans, Simon, Mohri, Christopher

arXiv.org Artificial Intelligence

Initial challenges, such as grade-school mathematics (GSM8K) and standard competition math (MATH dataset), have largely been surmounted, pushing the frontier of AI reasoning toward "grand challenge" problems, such as those found in the International Mathematical Olympiad (IMO). These problems, renowned for their demand for deep insight, creativity, and rigorous proof, expose a fascinating weakness in modern LLMs. While a model's performance on a single attempt (termed pass@1) may be very low, its ability to produce a correct answer within k attempts (pass@k) can be significantly higher. This pass@1 versus pass@k gap, especially pronounced when sampling with high temperature to produce diverse outputs, suggests that models possess a vast, latent capability that is not accessible in a single, high-confidence generation. Interestingly, to recover the full power of the model it is not sufficient to simply use multiple attempts. In fact, even the pass@k metric fails to capture the full story. On the most difficult problems, simply sampling k times and selecting the best answer (e.g., "best-of-32") still yields poor results. For instance, Huang and Yang (2025) report that a best-of-32 baseline on the IMO 2025 problems achieved an accuracy of only 31.6-38.1% for leading models [HY25]. This paradox lies at the heart of our work: the latent capability of LLMs is not merely a matter of selection (finding one correct needle in a haystack of k attempts), but one of synthesis.


Test-time Diverse Reasoning by Riemannian Activation Steering

Khanh, Ly Tran Ho, Zhu, Dongxuan, Yue, Man-Chung, Nguyen, Viet Anh

arXiv.org Artificial Intelligence

Best-of-$N$ reasoning improves the accuracy of language models in solving complex tasks by sampling multiple candidate solutions and then selecting the best one based on some criteria. A critical bottleneck for this strategy is the output diversity limit, which occurs when the model generates similar outputs despite stochastic sampling, and hence recites the same error. To address this lack of variance in reasoning paths, we propose a novel unsupervised activation steering strategy that simultaneously optimizes the steering vectors for multiple reasoning trajectories at test time. At any synchronization anchor along the batch generation process, we find the steering vectors that maximize the total volume spanned by all possible intervened activation subsets. We demonstrate that these steering vectors can be determined by solving a Riemannian optimization problem over the product of spheres with a log-determinant objective function. We then use a Riemannian block-coordinate descent algorithm with a well-tuned learning rate to obtain a stationary point of the problem, and we apply these steering vectors until the generation process reaches the subsequent synchronization anchor. Empirical evaluations on popular mathematical benchmarks demonstrate that our test-time Riemannian activation steering strategy outperforms vanilla sampling techniques in terms of generative diversity and solution accuracy.


Robust Exploratory Stopping under Ambiguity in Reinforcement Learning

Ye, Junyan, Wong, Hoi Ying, Park, Kyunghyun

arXiv.org Machine Learning

We propose and analyze a continuous-time robust reinforcement learning framework for optimal stopping problems under ambiguity. In this framework, an agent chooses a stopping rule motivated by two objectives: robust decision-making under ambiguity and learning about the unknown environment. Here, ambiguity refers to considering multiple probability measures dominated by a reference measure, reflecting the agent's awareness that the reference measure representing her learned belief about the environment would be erroneous. Using the $g$-expectation framework, we reformulate an optimal stopping problem under ambiguity as an entropy-regularized optimal control problem under ambiguity, with Bernoulli distributed controls to incorporate exploration into the stopping rules. We then derive the optimal Bernoulli distributed control characterized by backward stochastic differential equations. Moreover, we establish a policy iteration theorem and implement it as a reinforcement learning algorithm. Numerical experiments demonstrate the convergence and robustness of the proposed algorithm across different levels of ambiguity and exploration.


A Algorithms

Neural Information Processing Systems

" j for k: " 2 to n do x The result follows directly from Theorem 1 in Cranko et al. [2021]: sup Lemma 7. If Assumption 3 holds, for any " 1 is an eigenvector of H Similarly, applying Hoeffding's inequality and the Kantorovich-Rubinstein theorem gives us Probp E Theorem 9. Given a Bayesian network We prove the statements in this theorem in several steps. In order to prove (a) and (b), we will show that the DRO problem is strictly convex if true non-neighbors are known so that there is an optimal solution. We would like to show that the solution to Equation (4) with true non-neighbor constraints is optimal. In this way, we do not recover any non-neighbor nodes in the skeleton. We follow the proof of Lemma 11.2 in Hastie et al. [2015]. Until now, we have proven properties (a) and (b). In this way, we are able to recover all the neighbor nodes with a threshold β {2 . Now we are ready to prove (d). BIC is not applicable to skeletons. The best and runner-up results are marked in bold. Significant differences are marked by: (paired t-test, p ă 0. 05). The final sample complexity becomes m " O p C p ε